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Abstract — Haar discrete wavelet transform (DWT), the 
simplest among all DWTs, has diverse applications in signal 
and image processing fields. A traditional approach for 2D 
Haar DWT is ID row operation followed by and ID column 
operation. In 2002, Chen and Liao presented a fast algorithm 
for 2D Haar DWT based on segmented matrix. However, this 
method is infeasible for its high computational requirements 
for processing large sized images. In this paper, we have 
implemented the segmented matrix algorithm on a low cost 
NVIDIA's GPU to achieve speedup in computation. The 
efficiency of our GPU based implementation is measured and 
compared with CPU based algorithms. Our experimental 
results show performance improvement over a factor of 28.5 
compared with Chen and Liao's CPU based segmented matrix 
algorithm and a factor of 8 compared to MATLAB's wavelet 
function for an image of size 2560x2560. 

Index Terms — Haar discrete wavelet transform (DWT), CUD A, 
GPU, segmented matrix algorithm, parallel discrete wavelet 
transform 

I. Background and Introduction 

Discrete wavelet transforms (DWTs) has been used in a 
wide range of signal and image processing applications such 
as - image and video coding (MPEG-4 or JPEG 2000), pattern 
recognition, image watermarking, medical image analysis etc. 
In traditional approach, 2D (two-dimensional) Haar DWT is 
performed in two phase- one row operation, one column 
operation, and column operation cannot be performed until 
the row operation is completed. Therefore, the speed of 
computation degrades significantly. To address this problem, 
Chen and Liao [1] proposed the segmented matrix algorithm 
where computation is performed by data rearrangements and 
one matrix multiplication. Therefore, this simple algorithm can 
produce the same results as traditional 2D Haar DWT with a 
much faster speed. Moreover, it is highly suitable for parallel 
implementation as only two rows are involved in computation 
at a time. 

Nowadays large size images are common due to the 
availability and advancement of image capturing technology. 
Therefore many wavelet based applications have to manage 
large scaled image processing. Parallel computing is a direct 
way of speeding up these high computation requirements. A 
significant amount of works have already been done for all 



sorts of high performance computers, for special purpose 
hardware [2] -[4], for FPGAs [5] [6] and for SIMD architectures 
[7]. Considerable amount of speedup is also achieved by 
employing GPUs with OpenGL and Cg-based implementations 
for DWT computations [8]-[10]. 

However, GPU accelerated computation became especially 
interesting since early 2007 when NVIDIA introduced CUDA 
(Compute Unified Device Architecture) enabled GPUs, which 
offer massive parallel computation power. Providing many 
hundreds of gigaflops of processing power current GPUs are 
leveraging the parallel computation in a more efficient way 
than on a CPU [11]. 

Being harnessed by many researches, these commodity 
and readily available GPUs are providing dramatic 
computation speedup in various research fields. Joaquin 
Franco, Gregorio Bernabe, Juan Fernandez and Manuel E. 
Acacio [12] achieved significant speed up with NVIDIA's 
Tesla C870 over Intel's Core 2 Quad Q6700 (2.66GHz). Vaclav 
Simek and Ram Rakesh Asn [13] used CUDA enabled GPU 
for accelerated 2D wavelet based image compression. 
Recently, Wladimir J. van der Laan, Andrei C. Jalba and Jos 
B.T.M. Roerdink [14] implemented a fast hybrid method for 
2D DWT on CUDA for both 2D images and 3D volume data. 

In this paper we have implemented the segmented matrix 
algorithm for 2D Haar wavelet transform on a low cost, 
commodity GPU. Our objective is to achieve computation 
speed up to process large scaled images without increasing 
computational complexity and cost. 

II. Traditional Computation 

Haar DWT is the simplest since it only uses two low pass 
filter coefficients (1,1) and two high pass filter coefficients 
(1,-1). Haar wavelet transform in frequency domain can be 
obtained by addition and subtraction of the pixels of images. 
2D haar DWT decomposes an input image into four sub- 
bands, one average component (W LL ) and three detail 
components (W LH , W^, W HH ). 

Traditionally, 2D Haar wavelet transform can be 
accomplished by one row and one column operations where 
the result of row transform is the input of column transform. 
Fig. 1 represents the 2D Haar wavelet transforms of a 4x4 
image. 
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Figure. 1. 2D Haar DWT of a 4x4 image by traditional approach. 

III. Segmented Matrix Algorithm 

Chen and Liao [1] proposed a computationally fast 
algorithm called "segmented matrix algorithm" where 2D Haar 
DWT can be performed by only one matrix multiplication 
instead of two separate ID transforms. The step by step 
process of this algorithm is as follows. 
Step 1: Consider I as the input image of size mxn. Form B..=2x2 
sub-blocks from original image I where i=l . . .m/2 andj=l . . .n/ 
2. For example, 



Ai hi 
in is 



B 12 



iH il4 



Step 2: Z-scan each B.. and generate mxn row vectors A... For 



example. 



Step 3: Express these row matrices as an intermediate matrix 
M. 
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Step 4: Consider filter coefficient matrix 
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Step 5: Haar wavelet transform can be divided into four sub- 
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The rearrangement of the elements of H into four sub- 
matrices will produce the resultant Haar wavelet transform 
matrix W. 
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The rearrangements are as follows - 

a) The elements in the first column of H are filled in row 
by row. 

b) The elements in the second column of H are filled in W RL 
row by row. 

c) The elements in the third column of H are filled in W LH row 
by row. 

d) The elements in the fourth column of H are filled in W RH 
row by row. 

IV. CUDA Implementation 

The CUDA platform is currently concentrating an 
enormous attention due to its tremendous potential of parallel 
processing. In November 2006, NVIDIA introduced CUDA 
with a new parallel programming model and instruction set 
architecture to solve many complex computational problems 
very efficiently [11]. Each CUDA complainant device is a set 
of multiprocessor cores where each core has SIMT (Single 
Instruction, Multiple Thread) architecture. Today four quad- 
core CPUs can run only 16 threads concurrently, whereas the 
smallest executable parallel unit on a CUDA device comprised 
of 32 threads. All CUDA enabled NVIDIA GPUs support at 
least 768 concurrently active threads per multiprocessor. 
Moreover, some GPUs can support 1,024 or more active 
threads per multiprocessor [11]. Devices comprise of 30 
multiprocessors (e.g. NVIDIAGeForce GTX 280), can support 
more than 30,000 active threads [15]. A good parallel 
implementation of an application on a GPU can achieve more 
than 100 times speedup over sequential execution [16]. 

In SIMT architecture of CUDA, a portion of a parallel 
application executed many times independently on different 
data, by many threads running on different processors, at 
any given clock cycle. This parallel portion can be isolated 
into a function which is called kernel. Akernel is organized as 
a set of thread blocks and each thread block is, in turn, 
organized as a three-dimensional array of threads. Threads 
within the same block can efficiently cooperate through shared 
memory and can synchronize with each other. Each thread 
has its own unique thread ID which is defined by the three 
thread indices: threadldx.x, threadldx.y and threadldx.z. Each 
block is identified by a unique two-dimensional coordinate 
given by the CUDA specific keywords blockldx.x and 
blockldx.y. All blocks must have the equal number of threads 
organized exactly in the same manner. The use of 
multidimensional identifiers simplifies memory addressing of 
multidimensional data. The block and grid dimensions, 
collectively known as execution configuration, can be set at 
run-time. 

In our implementation we have used blocks each having 
16x16 threads. The grid size is set at run-time according to 
the size of input image. Our CUDA implementation consists 
of the following steps: 

1 . Copy image data from host memory to GPU memory. 

2. Determine the execution configuration. 

3. GPU executes kernel to compute the elements of the 
intermediate matrix M on each core in a parallel fashion. 
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4. The resulted matrix H is computed simultaneously in GPU. 

5. Copy the result from GPU memory to host memory. 

The CPU-based algorithm is implemented on Intel Pentium 
IV, 3.00GHz processor equipped with 512MB DDR2 RAM. 
The GPU based algorithm is tested on NVIDIA GeForce 
8500GT graphics card containing 16 cores, maximum 512 
threads per block and 512 MB global memory. 

V. Results and Discussion 

To test the computational efficiency of our GPU based 
segmented matrix algorithm, we have taken images of different 
sizes as inputs. Fig. 2 shows one level 2D Haar DWT of 
256x256 lena image using CPU based and GPU based 
segmented matrix algorithm. For comparison we also have 
considered MATLAB's dwt2() function from wavelet toolbox 
and the CPU implementation of segmented matrix algorithm. 




(a) (b) 
Figure. 2. One level 2D Haar DWT using (a) CPU based and (b) 
GPU based segmented matrix algorithm. 

Table I represents the comparison of computing time of 
MATLAB's dwt2(), segmented matrix algorithm on CPU and 
on GPU with increasing size of input images. 

TABLE. I. Computation Time Comparison Relative to Image Size 
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Table I shows that the performance of CPU based segmented 
matrix algorithm declined noticeably for large sized images, 
although it performed better for small sized images. In contrast, 
GPU based implementation of this algorithm improved the 
performance for large over a factor of 10 to 28 for images 
sized 1024x1024 to 2560x2560. Moreover, it performed better 
than MATLAB's wavelet function for all small and large sized 
images. Therefore, among the three algorithms our GPU based 
segmented matrix algorithm performed the best for high 
resolution images. 

However, the main drawback of GPU computation is the 
transfer time between the host memory and device memory. 
The time needed to copy data from the host's memory to 
GPU's global memory requires a large fraction of total 
execution time. Therefore, if we exclude the data transfer time 
from execution time, we would get significant speedup for 
large sized images. 

Conclusions 

The widespread usage of the Haar Discrete Wavelet 
Transform (DWT) has motivated the implementation of a 
simple and low cost GPU based DWT algorithm. Our 
experimental results show that for an image of size 2560x2560, 
the GPU based segmented matrix algorithm is more than 28.5 
times faster than CPU computation including data transfer. 
Moreover, this GPU based method achieved approximately 8x 
speedup than the CPU based computation of MATLAB's dwt2() 
for the same image. Due to the speedy calculations we believe 
that the ideas presented in this paper will have widespread 
applications in processing large sized images. 
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